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Abstract 

Background: Accurate recognition of regulatory elements in promoters is an essential prerequisite for 
understanding the mechanisms of gene regulation at the level of transcription. Composite regulatory elements 
represent a particular type of such transcriptional regulatory elements consisting of pairs of individual DNA motifs. 
In contrast to the present approach, most available recognition techniques are based purely on statistical evaluation 
of the occurrence of single motifs. Such methods are limited in application, since the accuracy of recognition is 
greatly dependent on the size and quality of the sequence dataset. Methods that exploit available knowledge and 
have broad applicability are evidently needed. 

Results: We developed a novel method to identify composite regulatory elements in promoters using a library of 
known examples. In depth investigation of regularities encoded in known composite elements allowed us to introduce 
a new characteristic measure and to improve the specificity compared with other methods. Tests on an established 
benchmark and real genomic data show that our method outperforms other available methods based either on 
known examples or statistical evaluations. In addition to better recognition, a practical advantage of this method is first 
the ability to detect a high number of different types of composite elements, and second direct biological 
interpretation of the identified results. The program is available at http://gnaweb.helmholtz-hzi.de/cgi-bin/l\/lCatch/ 
MatrixCatch.pl and includes an option to extend the provided library by user supplied data. 

Conclusions: The novel algorithm for the identification of composite regulatory elements presented in this paper was 
proved to be superior to existing methods. Its application to tissue specific promoters identified several highly specific 
composite elements with relevance to their biological function. This approach together with other methods will further 
advance the understanding of transcriptional regulation of genes. 



Background 

Deciphering the mechanisms of transcriptional regula- 
tion of gene expression is one of the key problems biolo- 
gists are facing. It is widely accepted to date that genes 
especially, in higher eukaryotes are regulated by a com- 
bination of transcription factors (TFs) bound to their 
cognate DNA sites, rather than by a single factor. There- 
fore, an extensive research is conducted on combinator- 
ial interactions of protein factors and their DNA binding 
sites (BSs) with respect to transcriptional activity of af- 
fected genes. The majority of present methods evaluate 
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the statistical properties of motif pairs (for review see 
[1]) or multiple combinations of motifs [2]. Some 
methods use comparisons with existing examples of 
motif combinations as a basis for recognition [3-6]. 

The minimal functional unit, which can provide com- 
binatorial regulation, is a composite element (CE). Struc- 
turally a CE consists of two closely located BSs for distinct 
transcription factors (TFs). But functionally CEs are con- 
sidered as single elements, since its regulatory function 
are qualitatively different from regulation effects of either 
individual BSs [7,8]. Function, structure and primary 
sequence of CEs are studied in a number of different ex- 
periments, in particular, to confirm protein-protein inter- 
actions and cooperative binding to DNA, as well as effects 
on transcriptional regulation. Such data on CEs can be 
found in databases such as TRANSCompel [9]. 
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The major problem in developing general recognition 
methods for CEs lies in the extremely limited number of 
experimentally defined CEs. For particular types of CEs 
some ad hoc methods have been suggested [3-5]. How- 
ever, the method, which can identif)^ many types of CEs 
[6] shows relatively poor recognition characteristics. 

The basic idea of the current method is to comple- 
ment existing knowledge on experimentally identified 
and functionally described CEs by data available for sin- 
gle BSs constituting the CEs. We demonstrate that such 
an integrative approach is able to model the heterogen- 
eity of CEs, which results in good recognition character- 
istics of the method. We also show that the existing 
variety of CEs is in no way a limiting factor to the 
method applicability. Quite the contrary, MatrixCatch 
with the provided library outperformed all statistical 
methods, that to date attract excessive attention of bio- 
informatics community. Elements of crowdsourcing 
were implemented in the website to allow further exten- 
sion the existed CE library. 

Methods 

Matrix model of CE 

The idea behind MatrixCatch is to complement the lack 
of knowledge on sequence variation of each DNA BS in 
CEs by recruiting data collected for respective BSs separ- 
ately from each other. Such information is compiled in 
position weight matrices (PWMs). Each CE will serve as 
a template for a model, which consists of two PWMs, as 
well as their minimal scores, relative orientation and dis- 
tance. Thus, PWMs, which are built using many single 
BSs, define sequence variability of BSs in the CE. Min- 
imal scores for PWMs, orientation and distance between 
PWMs are determined by the CE itself. 

Building the CE model 

First, PWMs related to the first binding TF are selected 
from the entire TRANSFAC library (in case there are sev- 
eral). Here and further we call the "first" and "second" BS 
in a CE model in accordance to the database annotation. 
Second, PWM scores are calculated for both orientations 
at the position of the first annotated BS in CE for all se- 
lected PWMs. Third, the combination of PWM, its score 
and its orientation, which delivers the lowest prediction 
rate on random sequences, is selected. Often, but not al- 
ways, it is the PWM with the highest score. This score be- 
comes the minimal required PWM score S^i in the model 
for the first BS. After repeating the same three steps for 
the second BS, all the parameters of the CE model are 
identified: PWM^, PWM2, their orientations, minimal 
scores S^i, S^2 and in-between distance D^, 

On this basis, we build 265 matrix models for all CEs 
collected in the TRANSCompel database. To search for 
potential CE, MatrixCatch will test these models on a 



DNA sequence. To be able to reveal "non perfect 
matches", model parameters like PWM scores (S^j, S^2) 
and distance {D^) should be relaxed. To increase the 
specificity of the search we introduced a "composite 
score" (CS). As will be showed later, this composite 
score provides higher recognition accuracy in compari- 
son to existing methods. 

Dependence between binding sites in CEs 

It was observed that the combination "one BS with low 
PWM score - another with high PWM score" in real CEs 
is more frequent then "low - low" (distribution of 
PWM scores in the constructed CE models can be seen 
in Figure la). Pearson correlation coefficient calculated 
for PWM scores equals -0.164 {p-value 0.003) indicat- 
ing negative correlation between matrix scores within 
one CE. To test the statistical relevance of this observa- 
tion, we investigated the distribution of PWM scores 
(S^i, S^2) in matrix models of "random CEs". Random 
sequence CEs were obtained from real CEs by 
reshuffling its DNA sequence. Matrix models for ran- 
dom CEs were constructed following the same proced- 
ure as for real CEs. The procedure with random CEs 
was repeated 4 times, generating 1060 models. Pearson 
correlation in this case was only -0.0088 (j^-value 0.39). 

Accuracy of the recognition method will obviously bene- 
fit when such mutual dependence of BSs is taken into ac- 
count. From Figure lA it becomes obvious that better 
separation of real and random CEs cannot be achieved by 
vertical or horizontal lines but rather by a diagonal. The 
diagonal corresponds to the sum of PWM scores, whereas 
vertical and horizontal lines are minimal scores for both 
BSs separately. Combination of restrictions on scores of 
both BSs individually (lines A'B' and B'C on Figure lA) 
and their sum (line EF) is one of the key points of the 
method and formally described in equation (4). 

Recognition rule 

Mathematically this approach has to be described as fol- 
lows. The diagonal or an absolute value of the composite 
score is defined by: 

absCS = Sral ^ Sm2, (l) 

where S^j, S^2 are PWM scores defined by the CE 
model. 

For the purpose of recognition we will use relative 
values for the composite score: 

relCS = —- + — , (2) 

where 82,2 are the actual matching scores of PWMs on 
an investigated DNA sequence. It is notable, that relCS 
may adopt negative values when one or both BSs of 
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minimal PWIVI score of the first BS, S^y Sum of matrix scores in CE model, S^y+S^2 



Figure 1 Distributions of PWM scores and distances between BSs in real and random CEs. (A) Distribution of PWM scores for first and 
second BSs in real CEs (red) and random sequence CEs (blue). Scores 5^/ and 5^2 define the rectangle OABC and perfectly separate high scoring 
CEs. By reducing the scores (dashed green lines), many additional true CEs, but also a large number of random CE are also covered by the 
rectangle OA'B'C. Introduction of a sum of scores (diagonal EF) greatly improves the separation between real and random CEs (discontinuous 
line A'E'F'O. (B) Distribution of distances between BSs and sum of matrix scores in real CEs (blue). Distance values were averaged in intervals of 
score values (1.75-1.80), (1.80-1.85), (1.85-1.90), (1.90-1.95) and (1.95-2.00) (red). The trend line reflects the dependence between PWM scores and 
distance between BSs. 



potential CE have higher PWM scores than defined by the 
model (Si > S^j and/or S2 > 8^2)- In such cases we say 
that the potential CE matches the model better than it is 
minimally required. Alternatively, another BS may have 
lower PWM score than required by the model, which cor- 
responds to "high-low" phenomena described above. 

To account for a relative positioning of BSs in CE we 
add a third term to (2): 

CS = ^^ + ^^ + X\D„-D\, (3) 

where D is the actual distance between identified BSs 
and - distance defined by CE model. 

Considering the physics of DNA-protein and protein- 
protein interactions, it can be suggested that remotely 
located BSs both might have higher affinity to their TPs 
compared to closely located ones. Despite the fact that 
DNA may form loops and BSs distant by sequence may 
become close in 3D, we found this suggestion relevant 
and subjected it to verification. 

Using all matrix models of CEs the distribution of dis- 
tances between BSs (D^) and the absCS was calculated 
(Pigure IB). Averaged distance between BSs show that 
CEs that have longer distances between BSs have on 
average a higher absCS, Linear regression coefficient 



between distance and sum of scores equals 53.62 with a 
90% confidence interval (40.9, 66.2). T-score of this re- 
gression is 7.6 with p'Vdlue of 0.004. 90% confidence 
interval for the slope value (53.62) equals (40.9, 66.2), 
95% - (38.5, 68.7). Therefore, our assumption on de- 
pendence on distance and quality of BSs within a CE 
can be regarded as statistically relevant. 

To make our method more stringent we considered 
both positive and negative fluctuation of distance D 
around the as unfavorable. Coefficient A in (3) was set 
to be equal to the slope value of the trend line (1/53.62). 

Pinally, a DNA sequence is reported as a potential CE, 
when the following recognition rule holds true: 

— ^ — ^^<1 

<i\2 

< Sni2 , (4) 

CS <Rcs 



where Rcs> R2 and are the relaxation parameters 
for the composite score CS, PWM scores and the dis- 
tance respectively. A maximum stringency search is 
achieved with all these parameters set to 0. 
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Input and output 

To run MatrixCatch, the user should supply (a) DNA se- 
quence(s) in EMBL, FASTA or plain text formats and 
(b) should define search stringency. Results are ordered 
by p-vdlue. Threshold for /7-value or expected frequency 
of CEs per Ikb can be optionally supplied. Calculation of 
raw j^-values and its correction for multiple testing can 
be done using Bonferroni (5b), Bonferroni step-down 
(5c), and Benjamini and Hochberg (5d) procedures by 
the formulas: 



\Dm-D\ 



p-value = \-{l-p'q) 



corrected _p-value = p-value -SequenceLength 



(5a) 



(5b) 



p-value = p-value '{SequenceLength-rank-CE) 



(5c) 



p-value = p-value -{SequenceLength/ rank -CE)^ 



(5d) 



where p {q) is a frequency of the first (second) BS of a 
CE found on a random sequence with PWM minimal 
score equals Si {S2), and rank_CE is the rank of CE in 
the output list sorted by j^-values before correction. 

All p'VBiue related parameters, namely p-vdlue thresh- 
old, type of P'Value correction or frequency of CEs per 
Ikb, can be adjusted after the search in order to refine 
the output. MatrixCatch produces a list of potential CEs, 
their positions, scores, /^-values and respective links to 
the original CEs in the database. Graphical visualization 
and machine readable output is also provided. 

In addition to the preloaded library users are encour- 
aged to create, store and search for their own CE models 
(please visit the website). To do this a user should select 
PWMs from the existing library, specify thresholds, ori- 
entations, interspace distance and optionally give a de- 
scription. Such an element of crowdsourcing allows a 
quick integration of novel data and its use by the com- 
munity. A single composite regulatory element found in 
a specific experiment is already sufficient to be submit- 
ted into the system and used without a need for a pro- 
gramming and/or an establishment of a separate 
website. As a gratitude for such submissions, users who 
will use these models in their research are requested to 
cite the work of the submitter. 

Results 

Comparison with other CE recognition methods 

At first, we compared our method to other available 
methods for CE prediction. CompelPatternSearch [6] is 



based on comparison of an original sequence of CE with 
an investigated sequence. By increasing the number of 
allowed nucleotide mismatches in both motifs and the dis- 
tance between them the accuracy of the method can be 
adjusted. Another method was specifically developed for 
the recognition of composite element NF-AT/AP-1 [4] 
with a score function based on weighted logarithms of 
PWM scores and a fixed length of intermediate sequence 
from 5 to llbp. False positive rates were estimated on se- 
quences of second exons derived from the human gen- 
ome, since they are supposed to comprise no regulatory 
elements. In all tests the elements to be recognized were 
excluded from the training data. All three methods were 
tested on the same dataset by the same procedure. 

Receiver operating characteristic (ROC) curves of the 
three methods tested on recognition of NF-AT/AP-1 are 
shown in Figure 2. ROC-curves for another two CEs (C/ 
EBP/NFkappaB and E2F/Spl) can be found in Additional 
file 1: Figures SI and S2. These tests show that MatrixCatch 
in general outperforms the simple pattern based search 
used in CompelPatternSearch. CompelPatternSearch per- 
forms similarly only when used with most stringent pa- 
rameters, i.e. when no mismatches are allowed in both 
BSs and length variation is not more than just a few nucle- 
otides. Relaxing parameters results in a sharp increase of 
the false positive rate. Already with >2 allowed mis- 
matches per BS, CompelPatternSearch becomes practic- 
ally unusable due to extreme number of predictions 
(Additional file 1: Figure SI). MatrixCatch performance is 
much more tolerant to parameter relaxation. This also 
shows that MatrixCatch is less subjected to an over- 
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Figure 2 Receiver Operating Characteristic (ROC) curves of 
three methods on recognition of CE NFAT/AP-1. 
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training effect, since more knowledge is enclosed in CE 
matrix models rather than just in the DNA sequence of CE. 

Unfortunately, many types of CEs are represented by a 
single example. In practical applications all are used for 
recognition, but for testing, obviously at least two known 
CEs of the same class are required. Therefore, a cross- 
validation for all elements is not feasible. We presented 
comparisons for two classes NF-AT/AP-1 and C/EBP/ 
NFkappaB that have the highest number of examples. 
However, even for smaller classes the performance of 
MatrixCatch is evident (Additional file 1: Figures S2). 

Comparisons with statistical methods 

First let us define what we call known, novel and de novo 
regulatory element. By known regulatory elements (both 
single sites and pairs) we assume those verified experi- 
mentally. By novel regulatory elements we assume those 
identified by any kind of computational comparison but 
without experimental verification on functionality. These 
elements can be found using similarity to known ones 
(then we say novel or potential BS and CE) or solely by 
statistical evaluations of motif frequencies in an investi- 
gated dataset (in this case we say de novo motif identifica- 
tion, for example see [1,10,11]). So, for example, 
MatrixCatch uses a library of CE models and hence finds 
novel composite elements. CMA and ModuleSearcher use 
a library for single sites (PWMs) and find novel single sites 
but discover pairs de novo, CisModule discovers single 
sites and pairs de novo purely based on statistics. Although 
these methods utilize different approaches, from practical 
view one would like to know which method(s) to apply 
first to, for example, a set of DNA sequences to have the 
highest chances of true discovery. In such cases collections 
of known elements are commonly used for evaluation of 
both library based and de novo methods. 

For testing of the performance of MatrixCatch we se- 
lected well established benchmark datasets [1], and as a 
quality measure, we chose the nucleotide-level correl- 
ation coefficient (nCC). We preferred nCC over PPV 
(positive predictive value), since the latter did not accur- 
ately account for situations when, for example, a pre- 
dicted module only slightly overlaps with a real one or is 
much longer then a real one. Instead nCC reflects the 
sensitivity and specificity of predictions by counting the 
number of correctly predicted nucleotides i.e. nucleo- 
tides that lie in an overlap of a predicted and a real mod- 
ule (for exact formula see [1]). 

The selected benchmark consists of TRANSFAC 
matrices related to the composite elements to be identi- 
fied, complemented by a number of "noise" matrices 
(not related to the CEs). Noise levels correspond to the 
number of the additional matrices in a set. The 
"noise_99" series comprises all PWMs. MatrixCatch 
was run with its default parameters, the entire library of 



CE models and with PWM datasets provided by [1] that 
correspond to the different noise levels. Reduction in 
the PWM library automatically directed MatrixCatch 
not to use CE models that comprise missing PWMs. Re- 
sults obtained were submitted for evaluation (http://tare. 
medisin.ntnu.no/composite/composite.php). Unfortunately, 
coMOTIF [11] converged to equiprobable PWM (all ele- 
ments equal 0.25) on all datasets. Other tests showed that 
coMOTIF performs better on data consisting from a large 
number of shorter sequences (data not shown). 

The results of the comparison are presented in Figure 3. 
It is evident that MatrixCatch significantly outperforms all 
other methods on all datasets. Despite such a good per- 
formance, one should note the different nature of these 
methods {de novo identification and library based) and the 
results need to be interpreted adequately. 

MatrixCatch was used with the entire CE library. It 
identified all CEs in each of the datasets (data not shown), 
which would indicate a sensitivity of 100%. However, we 
should point out that the identified CEs are the same that 
were used to build the models and MatrixCatch by its def- 
inition always identifies the CEs used to construct the 
models. This is the major difference to comparisons in the 
previous section, where respective CE models were re- 
moved from the CE library. Thus, comparing the sensitiv- 
ity parameter is not fully appropriate here. 

Instead, specificities of the predictions should be com- 
pared. nCC score is calculated upon all reported CEs and 
its higher values in all categories for MatrixCatch indicate 
higher specificity. This can be interpreted in such a way 




noiseO noiseSO noise75 noise90 noise95 noise99 



Figure 3 Nucleotide level correlation scores (nCC) on the 
TRANSCompel dataset. Nucleotide level correlation scores (nCC) on 
the TRANSCompel dataset. The graphs show nCC scores at 
increasing noise levels. Values for CisModule could be calculated only 
for the "noiseO" dataset. For further details see (Klepper et al. [1]). 

V J 
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that MatrixCatch not only identified all true CEs in the 
dataset but also did not report too many false hits. 

However, if we assume that a dataset contains only 
regulatory elements principally different from those in 
the library, priority should be given to de novo identifica- 
tion methods. The practical application of MatrixCatch 
presented in the next section shows that the existing 
variety of known CEs is already sufficient to outperform 
statistical methods in most of situations. 

Investigation of tissue-specific promoters 

An experimental study of tissue-specific promoters was 
recently performed by [12]. The authors investigated the 
expression of genes triggered by alternative promoters in 
different tissues. They could show that transcription from 
alternative promoters differs significantly in most investi- 
gated cases. Therefore, tissue specific promoters found in 
that study represent a competitive example for bioinfor- 
matics analysis. We will search for potential composite 
regulatory elements similar to known ones using 
MatrixCatch and novel combinations using other pro- 
grams. The key question is which program can identify el- 
ements that are most specific to the dataset of interest. 

Using the data provided by [12], 11 datasets of positive 
and negative promoters with a length of SOObp and Ikb 
that covered regions -400 to +100 and -900 to +100 
around the TSS, respectively, were generated (datasets can 
be found in Additional file 2). For the discovery of cis- 
regulatory modules, methods reviewed by [1] were se- 
lected. Out of eight programs, two are not available to date 
(MSCAN and Stubb). Cluster- Buster and Cister could not 
be applied, since they require a single sequence as input, 
but not a set. MCAST identified very long modules with 
many motifs. For instance, in the SOObp breast dataset 
MCAST reported a module 355bp long with 23 motifs as 
a top hit. Though of very significant £-value, this result 
seems to have little practical use. Finally, only three pro- 
grams, CisModule, ModuleSearcher and CMA in addition 
to MatrixCatch were used for the analysis. 

The goal was to identify such a module (s) that can be 
found in at least Min^ of positive promoters and in no 
more than Max~ of the negative ones. If we denote 
and Cr the normalized number of positive and negative 
promoters comprising a module, then the above can be 
formalized: > Min^ and C~ < Max~, Several values 
for Min^ and Max~ were fixed: (0.90, 0.50), (0.75, 0.50), 
(0.66, 0.50), (0.50, 0.25), (0.33, 0.15). 

All programs were run with default parameters except 
the following. The number of single PWMs in a module 
was set to 2 in CMA, ModuleSearcher and CisModule. In 
ModuleSearcher "Number of top scoring modules to re- 
turn" was set to 10. CMA was set to output 5 pairs (max- 
imum allowed) and to optimize distance of a module. 
Both above programs used the TRANSFAC library of 



PWMs. CisModule does not require PWMs, since it iden- 
tifies them during the search. In summary, all programs 
were set to find several modules each consisting of a pair 
of DNA motifs. Since ModuleSearcher and CisModule 
cannot use negative datasets, the results of all three pro- 
grams were additionally optimized in order to maximize 
the ratio C^/CT, provided that the boundary conditions for 
and cr hold true. This was done by varying independ- 
ently the minimal required scores for both PWMs in a 
module and the one with the highest C^ICT is reported as 
a hit. MatrixCatch was run with entire library of CE 
models and relaxation parameters were adjusted for max- 
imum c^/cr. 

We believe that this determination of the method per- 
formance is straightforward and is most indicative in 
real applications. Indeed, no common measures like false 
positives, true negatives etc, can be calculated, since 
regulatory modules are to be discovered de novo. Tests 
on re-discovery of known examples are presented above. 

Results of the application of the four methods are 
presented in Tables 1, 2 and Additional file 1: Table SI. As 
can be seen from Table 1, in each specificity group 
MatrixCatch has found modules in more datasets, com- 
pared to the other methods. For example, in a group (C"^ > 
0.75 and C < 0.50) MatrixCatch found CEs in breast, 
heart, kidney and prostate promoters, while CMA and 
ModuleSearcher only in prostate promoters. 

Out of four methods only MatrixCatch was able to iden- 
tify a regulatory element with very high specificity (group 
0.90/0.50 in Table 1, CE number 112, relaxation parame- 
ters: i?j=0.02, i?2=0.26, i?c5=0.20 and i?^=0.32). This CE 
could be recognized in 16 out of 17 promoters active in 
prostate (/^-value 5.624*10'^, promoters and CEs are 
graphically represented in Additional file 1: Figure S3). As 
was identified in a study of chicken myeloid cells both mo- 
tifs of this CE are bound by C/EBP-related proteins [13]. It 
is very important to mention that C/EBP transcription fac- 
tor was later found to upregulate metastatic gene expr- 
ession in human prostate cancer cells [14,15]. This 
demonstrates that MatrixCatch identified highly specific 
regulatory elements the functionality of which was con- 
firmed by several independent studies. In comparison, 
other programs could identify modules only in 13 (CMA, 
ModuleSearcher) or 12 (CisModule) promoters. None of 
the methods found an element similar to C/EBP binding 
motif. We may speculate that elements reported by statis- 
tical methods may represent some functionality, but no 
other support than statistical significance can yet be 
presented. 

To emphasize the importance of the developed ap- 
proach, we should mention that this type of CE is repre- 
sented by a single example. As can be seen from Table 3 
newly discovered CEs in prostate promoters don't show 
many conserved positions in either motif. Approaches 



Deyneko et al. BMC Bioinformatics 2013, 14:241 
http://www.bionnedcentral.conn/1471-2105/14/241 



Page 7 of 10 



Table 1 recognition of regulatory elements in tissue specific promoters 


Specificity level ( Min^ / Max ) 


0.90/0.50 


0.75/0.50 


0.66/0.50 


0.50/0.25 


0.33/0.15 


MatrixCatch 


1 


4 


7 


4 


5 


CMA 


0 


1 


3 


0 


1 


ModuleSearcher 


0 


1 


6 


1 


3 


CisModule 


0 


0 


1 


1 


2 



Number of datasets of tissue specific promoters in which the programs found at least one module with the required level of specificity. The total number of 
datasets is 1 1 . 



based on mere pattern matching of the DNA sequence of 
the CE itself (as for example, CompelPatternSearch [6]) 
would produce a huge number of hits, which renders pre- 
dictions useless. Matching the motifs independently (as 
statistical methods do) will not help to reveal this CE ei- 
ther, due to the low score of one of the BSs. Indeed, com- 
posite elements in genes NETl, SULFl MADILI, 
KIAA1539, SDR39U1 and COL4A6 have one C/EBP site 
recognized with a very low PWM score (Table 3). Never- 
theless, the second site, recognized with a high PWM 
score, contributes to the overall composite score (3) of the 
pair. Thus, in all of the above-mentioned genes the com- 
posite score entailed specific recognition of the regulatory 
element. 

Altogether, using the approach presented here it be- 
came possible to build up a matrix model for a singular 
example of a C/EBP/ C/EBP composite element and use 
this model for recognition of new potential regulatory 
elements in prostate promoters with high specificity. 
Therefore, highly reliable experimental knowledge is not 
dismissed due to statistical considerations. 

We investigated potential composite elements identified 
with specificities C^ > 0.75, C" < 0.50 (in Additional file 1: 
Table SI) for their biological relevance. CE NF-kappaB/ 
ATF-1 (relaxation parameters: i?j=0.06, i?2=0.10, Rcs=OJO 
and Rr>=0A8) was found specific (0.75/0.391) to promoters 
active in breast tissue and was described as activator of 
interleukin 2 gene [16]. Although neither NF-kappaB nor 
ATF-1 per se exhibits any specific tissue specificity, the 



NF-kappaB family has shown to be active in human breast 
cancers [17]. Taking into account that composite elements 
often have their own transcriptional function [8], this 
element may represent a promising example for further 
investigations. Another element c-Myb/Ets-1 {R 2=0,08, 
i?2=0.10, i?c5=0.10 and Rd=028\ found in heart specific 
promoters, contains Ets-1 as one of the contributing fac- 
tors, which has been shown to be expressed during heart 
development in mouse [18]. The third element HNF-4a/ 
HNF-4a found in kidney promoters (7^^=0.20, i?2=0.26, 
Rcs=OJO and Rr>=076) is known to play a role in develop- 
ment of the liver, kidney and intestines. Altogether, these 
examples show that MatrixCatch is able to identify poten- 
tial composite elements that are not only specific, but are 
also biologically relevant to the investigated datasets. The 
biological knowledge behind is an important advantage in 
comparison to methods based on pure statistics. 

An interesting dependence on the input data is shown 
by the programs CisModule and ModuleSearcher. 
ModuleSearcher identified regulatory modules substan- 
tially in Ikb promoters, whereas CisModule in 500bp (in 
Additional file 1: Table SI). Such a behavior may impede 
the practical applications of these methods since there is 
no agreement on a "proper" length of a promoter. 
MatrixCatch is more tolerant towards the input data as 
well as to the optimization of parameters. Results in 
Additional file 1: Table SI show that in general 
MatrixCatch finds composite modules in many specificity 
groups. There are just a few cases when modules that 



Table 2 Specificity values of regulatory modules 


Dataset (number of seq.) 


MatrixCatch 


CMA 


ModuleSearcher 


CisModule 


Breast (24) 


5.29 


1.65 


2.90 


3.66 


Heart (68) 


2.60 




1.38 




Kidney (51) 


3.47 


1.46 


2.54 




Muscle (86) 


1.43 




1.35 




Pancreas (61) 


2.56 




1.43 




Prostate (17) 


9.54 


6.19 


2.49 


6.54 


Thyroid (74) 


1.62 




1.40 




Highest values of specificity (CVC) shown by the programs in different datasets. None of the programs found modules in the datasets: Cerebellum, Liver, Spleen 



and Testis. 
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Table 3 Composite element In prostate specific promoters 



Name^ 


Gene 


Position^ 


Strand 




s| 


CS 


p-value 


Sequence 






Original composite element sequence 












ATGAGGCAAT 


c^gcact 


GTTGCCACAT 


uc002uum.l 


M0B4 


-346 


+ 


0.972 


0.976 


0.012 


3.801 e-06 


AG^GCGAAAAT 


gctgt^ 


GTirCTOAGAGA 


uc003jwu.l 


OCLN 


-213 


+ 


0.973 


0.949 


0.202 


9.234e-05 


AGATTCAGAAACA 


gc^ccaatg 


TTTACACACGACT 


uc003qcg.l 


EPB41L2 


-100 


+ 


0.988 


0.981 


0.099 


1.418e-06 


AGATTTTGAAATG 


ctac 


TmCACAAAATA 


ucOOliia.l 


NETl 


-369 


+ 


0.991 


0.765 


0.207 


1.091e-05 


ACCTTTGGTAATT 


ggaaat 


ATATCTCATATO 


uc002eby.l 


ZNF843 


-352 


+ 


0.963 


0.929 


0.123 


1 .895e-04 


AGCCTAGGCAAAA 


^agcacg 


ATTCCGTCTCAAG 


uc004dpe.l 


SHR00M4 


+3 


+ 


0.961 


0.915 


0.140 


3.402e-04 


TGCTATTGTAAAT 


^gaact^ 


TTTTCTTTCmC 


Sequence complementary to the original composite element sequence 






ATGTGGCAAC 


6/gtgccg 


ATTGCCTCAT 


uc003edg.l 


C3orfl5 


-317 




0.946 


0.959 


0.110 


1 .347e-04 


TGGCTGAGAAAAT 


caatgac 


ATTGCTTATGAAA 


uc003fsb.l 


TP63 


-345 




0.924 


0.972 


0.228 


2.397e-04 


ACAAAGAGTAAAA 


agaaaag 


TTTTCATAAAGGA 


uc003gno.l 


C1QTNF7 


-27 




0.947 


0.997 


0.234 


4.040e-06 


AAACTGAGAAAGA 


taa 


CTTTCTGAAATGC 


uc003xye.l 


SULFl 


-333 




0.728 


0.987 


0.304 


1 .456e-04 


AAAGAAAGGTAGG 


ca 


GTTGCAAAACTO 


uc002tah.l 


AFF3 


-149 




0.922 


0.993 


0.046 


4.600e-06 


TCAGAAGGAAAAA 


cigrttag 


ATTTCAAAATGTA 


uc003sli.l 


MADILI 


+2 




0.761 


0.981 


0.276 


1 .790e-04 


TGTCTAGGGGAGA 


tooaat 


CTTGCCTAAGCAA 


uc003zwl.l 


KIAA1539 


-310 




0.760 


0.959 


0.300 


8.385e-04 


CTCCGTAGTCACC 


agatttt 


ATTTCACAAGGTG 


ucOOllwy.l 


SLC22A18 


-113 




0.939 


0.965 


0.167 


1.691e-04 


CGCTCCCGGAACT 


tccc/t 


TTTACATATGAGG 


ucOOl wpn.l 


SDR39U1 


-12 




0.767 


0.993 


0.313 


3.045e-05 


TTAGTGAGACAAT 


ggcg 


ATTGCAAAGCGCG 


uc004env.l 


COL4A6 


-44 




0.752 


0.981 


0.285 


2.156e-04 


TGAGATGGACA^ 


ttattttt 


ATTGCCTAAACTG 



Composite regulatory element C/EBP / C/EBP recognized in promoters of genes active in prostate tissue. Nucleotides with significant conservation shown in bold 
(within binding motifs) and italics (intermediate sequence). 
^ Names according to (Jacox et al., [12]). 
^ Beginning of the element relative to TSS. 

^ S; 2 - PWM scores for the first and second C/EBP motif, CS - composite score. 



discriminate positive and negative datasets are found ex- 
clusively in one specificity group which corresponds to 
one specific set of relaxation parameters. For example, 
modules found in pancreas and thyroid promoters are 
probably false hits, since they can be identified only in the 
specificity group {C^ > 0.66, C < 0.50), which may repre- 
sent an artefact of parameters optimization. As a rule, if 
MatrixCatch identifies a composite module it can be 
found in several specificity groups, which proves greater 
tolerance to search parameters than in other methods. 

Discussion 

Investigation of transcriptional regulation of genes by bio- 
informatic methods is widely used in biomedical research 
and the presented approach contributes to that topic. The 
software MatrixCatch is supplied with 265 matrix models 
of composite elements, which represents the most com- 
prehensive collection of known CEs available to date. The 
program has no restriction on the size of promoters and is 
suitable for examination of a single short DNA locus of 
particular interest or big datasets representing the whole 
genomes. The search stringency can be easily adjusted via 
several parameters. The program was tested for recogni- 
tion of known composite elements and compared with 
other programs on the established datasets. In all cases. 



MatrixCatch outperformed other methods. In a real study 
of tissue specific promoters, MatrixCatch identified a can- 
didate composite element that is specific to promoters ac- 
tive in prostate, which we offer for further investigation. 
Other methods identified hits with much lower specificity 
and for many tissues they were not able to find any. 

In the Introduction we pointed out that the problem 
in developing CE recognition methods lies in the ex- 
tremely limited number of experimentally characterized 
and documented CEs. We may speculate that this could be 
a major reason why there is a bias towards statistical 
methods rather then methods based on experimental ex- 
amples. In addition, many algorithms for the recognition of 
particular examples have no software implementation [3] 
or the announced web resource is not maintained anymore 
[5]. To the best of our knowledge, MatrixCatch is the only 
ready-to-use application available to date that is designed 
for recognition of known composite regulatory elements. 

One fundamental question is whether DNA motifs con- 
stituting a CE and bound by interacting protein factors are 
similar to those bound by the same factors separately. This 
is an important issue, since it allows a generalization of 
the search by recruiting the information available for the 
single binding motifs. Similar performance of our method 
and the one described by [4] (Figure 2) suggests no or very 
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minor changes of binding motifs, since the latter method 
uses exclusively DNA sequences of CEs for motif recogni- 
tion. This method definitely accounts for all kinds of de- 
pendences between motifs - if any. But based on that 
principle, recognition methods could be constructed for 
just a few types of CEs, for 2 or 3 at best, since statistics 
become a critical issue. We can speculate that some TP 
binding motifs may be different in single sites and within 
composite elements, where they are bound by a TP com- 
plex. There are cases when subsets of a specific motif of 
single sites appear as constituents of CEs [19]. However, 
data available to date do not provide sufficient experimen- 
tal evidences either to support or reject this. Similar re- 
sults of this and the previous method [4] suggest that 
single binding motifs are at least not strongly changed, 
which allows to build a method for recognition of many 
types of CEs. 

The presented approach has the advantage that 
already on the basis of any single identified CE, a matrix 
model can be constructed, which will ensure a reUable 
recognition. Thus, existing limited although valuable 
knowledge on combinatorial regulation of transcription 
can be used for the discovery of similar regulatory ele- 
ments in other genes and/or related genes in different 
organisms. Together with other methods, both statis- 
tical and library based, MatrixCatch may serve as a basis 
for more sophisticated combinatorial analysis of pro- 
moters, enhancers or other regulatory regions, thereby 
helping to understand complex transcriptional regula- 
tion of genes and reconstruct complete hierarchical 
regulatory models. 

Conclusions 

Here, we have presented a novel methodology for the iden- 
tification of composite regulatory elements in promoter se- 
quences. The software implementation MatrixCatch is 
supplied with a library of 265 matrix models used for rec- 
ognition. That represents the widest scope of known CEs 
available to date. Additionally, this library can be easily ex- 
tended via user supplied models. Investigation of regular- 
ities encoded in known composite elements helped to 
improve the specificity of the identification compared to 
other methods, that is proved on an established benchmark 
and real genomic data. Another advantage of the approach 
is that on the basis of any single newly discovered CE, a 
matrix model can be constructed and used for the recogni- 
tion. A practical advantage of this method compared to 
statistical methods is the direct biological interpretation of 
the identified results. 
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